This markdown file is created by Sunah Park for extended lab exercises in the book An Introduction to Statistical Learning with Applications in R (ISLR).
library(MASS)
library(ISLR)
library(ggplot2)summary(Boston)## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
head(Boston,3)## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
Boston<-na.omit(Boston)
sum(is.na(Boston$medv))## [1] 0
names(Boston)## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
lm() function is used to fit a simple linear regression model.
lm.fit<-lm(medv~lstat, data=Boston) # y: medv(median house value for 506 neighborhoods in Boston), x: lstat(percent of households with low socioeconomic status) from Boston database
summary(lm.fit) # infos Std.Error, tvalue, P-value, R-sqaured, F-stat##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
names(lm.fit) # Informations stored in linear model## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit) # Coefficients of linear model## (Intercept) lstat
## 34.5538409 -0.9500494
confint(lm.fit) # Confidence interval for the coefficient estimates## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
pred.lm.fit<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="confidence"); pred.lm.fit## fit lwr upr
## 1 33.6037915 32.564024 34.643559
## 2 28.8535448 28.111211 29.595878
## 3 24.1032980 23.546025 24.660571
## 4 19.3530512 18.753385 19.952718
## 5 14.6028045 13.767222 15.438387
## 6 9.8525577 8.700886 11.004229
## 7 5.1023109 3.604296 6.600326
## 8 0.3520641 -1.505704 2.209832
## 9 -4.3981826 -6.622617 -2.173748
## 10 -9.1484294 -11.743514 -6.553345
pred.lm.fit2<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="prediction"); pred.lm.fit2## fit lwr upr
## 1 33.6037915 21.347614 45.859969
## 2 28.8535448 16.619011 41.088079
## 3 24.1032980 11.878597 36.327999
## 4 19.3530512 7.126344 31.579758
## 5 14.6028045 2.362259 26.843350
## 6 9.8525577 -2.413620 22.118735
## 7 5.1023109 -7.201218 17.405839
## 8 0.3520641 -12.000428 12.704556
## 9 -4.3981826 -16.811114 8.014749
## 10 -9.1484294 -21.633109 3.336250
The fitted model for simple linear regression: \[ {medv}=34.55-0.95\times{lstat} \]
ggplot(data=Boston, aes(x=lstat, y=medv))+geom_jitter(alpha=.7, color="grey")+stat_smooth(method="lm", level=0.95)+theme_bw(base_family="Avenir")df<-data.frame(x=Boston$lstat,
y=Boston$medv,
y.fit=lm.fit$fitted.values,
residuals=lm.fit$residuals,
i=seq(1:length(Boston$lstat)))When fitting a least squares line of linear model, we generally require following conditions.
ggplot(df,aes(x=x,y=residuals))+geom_point(alpha=0.3)+geom_hline(yintercept=0, color="blue",linetype="solid",size=0.5) Diagnose: There is some evidence of non-linearity
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue") Diagnose: There is a right skew
ggplot(data=df, aes(x=y.fit,y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Diagnose: The variability of residuals is non-constant.
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Diagnose: There is no increasing or decreasing pattern of residuals. The model does not suffer from time-series structure.
# Model selection
lm.fit<-lm(medv~., data=Boston) # Full model where y: medv, x: all predictors in Boston
summary(lm.fit) # infos Std.Error, tvalue, p-value, R-squared, F-stat##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
# Model selection using p-value Backwards elimination (Start with the full model -> Drop the variable with the highest p-value and refit a smaller model -> Repeat until all variables left in the model are significant (p-value<alpha)).
lm.fit<-lm(medv~.-age, data=Boston) # The variable age is now removed from the model, as it has the highest p-value that is higher than the chosen significance level alpha of 0.05 in this case.
summary(lm.fit)##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
lm.fit<-lm(medv~.-age-indus, data=Boston) # The variable age and indus are now removed from the model
# Alternatively, the update() function can be used (e.g. update(lm.fit, ~.-age))
summary(lm.fit) # We now have all the variables whose p-value is less than alpha. Our final model has 11 predictors.##
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
The final equation is: \[ {medv}=36.34-0.11\times{crim}+0.05\times{zn}+2.72\times{chas}-17.38\times{nox}+3.80\times{rm}-1.49\times{dis}0.30\times{rad}-0.01\times{tax}-0.95\times{ptratio}-0.01\times{black}-0.52\times{lstat} \]
attributes(summary(lm.fit))## $names
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
##
## $class
## [1] "summary.lm"
summary(lm.fit)$adj.r.squared## [1] 0.7348058
The fitted model has adjusted R-squared of 0.73. In other words, 73% of the variability in our response variable medv can be explained by the model.
df<-data.frame(y=Boston$medv,
y.fit=lm.fit$fitted.values,
residuals=lm.fit$residuals,
abs.residuals<-abs(lm.fit$residuals),
i=seq(1:length(Boston$lstat)))Multiple regression methods generally depend on the following four assumptions:
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue") Diagnose: There is a right skew and some outliers, but not highly extreme.
ggplot(data=df, aes(x=y.fit,y=abs.residuals))+geom_jitter(alpha=.5) Diagnose: There are three data points which have residual value higher than 20, but not highly significant. The variance is fairly ok.
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Dignose: We see no structure that indicates a time-series issue.
In case of discrete variables:
lm.fit1<-lm(medv~lstat, data=Boston)
lm.fit2a<-lm(medv~lstat+I(lstat^2), data=Boston)
lm.fit2b<-lm(medv~I(lstat^2), data=Boston)
lm.fit3<-lm(medv~poly(lstat,3), data=Boston)
lm.fit4<-lm(medv~poly(lstat,4), data=Boston)
lm.fit5<-lm(medv~poly(lstat,5), data=Boston)
lm.fit6<-lm(medv~poly(lstat,6), data=Boston)
lm.fit7<-lm(medv~log(lstat), data=Boston)We use anova() function to quantify the extent to which the non-linear transformation fit is superior to the linear fit.
anova(lm.fit1,lm.fit2a)## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135 and the p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat.
anova(lm.fit1,lm.fit2a, lm.fit2b,lm.fit3,lm.fit4,lm.fit5,lm.fit6,lm.fit7)## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Model 3: medv ~ I(lstat^2)
## Model 4: medv ~ poly(lstat, 3)
## Model 5: medv ~ poly(lstat, 4)
## Model 6: medv ~ poly(lstat, 5)
## Model 7: medv ~ poly(lstat, 6)
## Model 8: medv ~ log(lstat)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 151.8623 < 2.2e-16 ***
## 3 504 26180 -1 -10833.3 398.8158 < 2.2e-16 ***
## 4 502 14616 2 11565.1 212.8774 < 2.2e-16 ***
## 5 501 13968 1 647.8 23.8477 1.406e-06 ***
## 6 500 13597 1 370.7 13.6453 0.0002452 ***
## 7 499 13555 1 42.4 1.5596 0.2123125
## 8 504 14312 -5 -757.6 5.5779 5.170e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparison of adjusted r squared
adjr<-c(summary(lm.fit1)$adj.r.squared,
summary(lm.fit2a)$adj.r.squared,
summary(lm.fit2b)$adj.r.squared,
summary(lm.fit3)$adj.r.squared,
summary(lm.fit4)$adj.r.squared,
summary(lm.fit5)$adj.r.squared,
summary(lm.fit6)$adj.r.squared,
summary(lm.fit7)$adj.r.squared)
adjr## [1] 0.5432418 0.6392883 0.3858904 0.6558029 0.6704019 0.6785066 0.6788660
## [8] 0.6642814
Including additional polynomial terms leads to an improvement in the model fit as the increasing adjusted r squared demonstrates. However, no polynomial terms beyond fifth order have significant p-values in a regression fit.